We will be exploring a data set of various beers and breweries provided by Anheuser Busch to assess the relationship between International Bitterness Units (IBU) and Alcohol By Volume (ABV). We will also seek to share other insights from this data related to IBU and ABV such as which beers have the highest IBU and ABV.
# read data
# GETTING DATA FROM FILES STORED IN MICROSOFT AZURE CONTAINER
breweries = read.csv(
(textConnection(
getURL("https://cs79521a4cc02bcx4125xab3.blob.core.windows.net/mixone-box/Breweries.csv"))),header = TRUE)
beers = read.csv(
(textConnection(
getURL("https://cs79521a4cc02bcx4125xab3.blob.core.windows.net/mixone-box/Beers.csv"))),header = TRUE)
The top three states with the most breweries in our data set are California, Colorado, and Michigan. It is not surprising for California to top this list due to their population but Colorado and Michigan have considerably lower populations yet still maintain a higher number of breweries indicating that population is not the only factor that determines the number of breweries in a state. It’s likely that certain cultural and economic aspects of the population influence the number of breweries in that state as well. Furthermore, the mpa of the US shows that Breweries do not dominate any particular region of the US.
breweries$State <- trimws(breweries$State)
# use lat/lon data from us_map package to estomate center of state
coordinates<-us_map(regions = "states")
coordinates = coordinates %>% dplyr::group_by(abbr) %>%
dplyr::summarize(Lat = mean(y), Lon = mean(x))
#sum breweries by state. "fips" is a key for us states
state_breweries<-breweries %>% dplyr::group_by(State) %>%
dplyr::summarise(Count_of_Breweries = n()) %>%
dplyr::mutate(fips = fips(State)) %>%
inner_join(coordinates, by=c("State" = "abbr"))
#plot us map
plot_usmap(data=state_breweries, values="Count_of_Breweries", color="black") +
scale_fill_continuous(low="white", high="red", name="Number of Breweries") +
theme(legend.position = "right") +
ggtitle("Breweries in the United States")
# bar chart of states
breweries %>% dplyr::group_by(State) %>%
dplyr::summarize(count = n()) %>%
ggplot(aes(x=reorder(State, count), y = count)) +
geom_bar(stat="identity", width=.5, fill="orange") +
labs(title="Number of Breweries Per State",
x = "State",
y = "Brewereis Count") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
IBU are not regularly reported for beers indicated in the plot below. Very few measurements of Alcohol by Volume are missing but missing IBU accounts for about 40% of all beers in this data set.
#Force any empty strings to become NA.
#Style is the only character vector with empty strings
beer_merged$Style[beer_merged$Style==""]=NA
#Create data frame for missing values plot
missing_values2 <- beer_merged %>%
gather(key = "key", value = "val") %>%
dplyr::mutate(isna = is.na(val)) %>%
dplyr::group_by(key) %>%
dplyr::mutate(total = n()) %>%
dplyr::group_by(key, total, isna)%>%
dplyr::summarise(num.isna = n())%>%
dplyr::mutate(pct = num.isna / total * 100)
levels <- (missing_values2 %>% filter(isna == T) %>% arrange(desc(pct)))$key
#plot percentage of missing values
percentage.plot <- missing_values2 %>%
ggplot() + geom_bar(aes(x = reorder(key, desc(pct)), y = pct, fill=isna), stat = 'identity', alpha=1) +
scale_x_discrete(limits = levels) +
scale_fill_manual(name = "", values = c('steelblue', 'tomato3'), labels = c("Present", "Missing")) +
coord_flip() +
labs(title = "Percentage of missing values", x = 'Variable', y = "% of missing values")
percentage.plot
#Create table for missing values
missing_values<-sort(sapply(beer_merged, function(x) sum(is.na(x))), decreasing = T)
missing_values %>% kable("html") %>% kable_styling()
| x | |
|---|---|
| IBU | 1005 |
| ABV | 62 |
| Style | 5 |
| beer_name | 0 |
| Beer_ID | 0 |
| Brewery_id | 0 |
| Ounces | 0 |
| brewery_name | 0 |
| City | 0 |
| State | 0 |
The missing values for IBU and ABV do not necessarily represent a zero value. We will replace missing IBU and ABV with the average IBU and average ABV for their respective Style. In the case of Ciders, we have 37 beers in our data set with no IBUs so we will assume these IBUs are actually zero. For the eight remaining styles that have no IBUs reported (Mead, Rauchbier, Kristalweizen, Flanders Red Ale, Shandy, American Malt Liquor, Braggot, Low Alcohol Beer), its unclear if these IBUs are actually non-zero so we will omit these which results in dropping only 15 beers from our data set.
#drop beers with no style
beer_merged <- beer_merged[!is.na(beer_merged$Style),]
replaceNAs<-beer_merged %>%
dplyr::select(Style, ABV, IBU) %>%
dplyr::group_by(Style) %>%
dplyr::summarise(meanABV=mean(ABV, na.rm=TRUE), meanIBU=mean(IBU, na.rm=TRUE))
#replace missing ABV with mean ABV per Style
beer_merged<-left_join(beer_merged, replaceNAs, by="Style")
beer_merged$ABV[is.na(beer_merged$ABV)]=beer_merged$meanABV[is.na(beer_merged$ABV)]
#replace missing IBU with mean IBU per Style
beer_merged$IBU[is.na(beer_merged$IBU)]=beer_merged$meanIBU[is.na(beer_merged$IBU)]
#For Cider IBU=0
beer_merged$IBU[beer_merged$Style=="Cider"]=0
#Drop the remaining styles with no IBUs listed for any beers.
beer_merged<- beer_merged[!is.na(beer_merged$IBU),]
The state with the highest median ABV is Kentucky while the highest IBU is Delaware. In general, the median IBU does not change considerably across States. A couple exceptions to this are New Jersey And Utah which seem to drop off significantly in ABV at the end of the plot. Median IBU has a much larger range across States than ABV. States like Delaware, West Virginia, and Minnesota tend to prefer more bitter beers while states like New Hampshire and Wyoming prefer less bitter. There is no apparent geographical pattern to explain these differences. Further data gathering would need to be done for each state to determine the reason for these differences.
medians = data.frame(beer_merged %>% dplyr::group_by(State) %>%
dplyr::summarize(median_ABV = median(ABV), median_IBU = median(IBU)))
medABVPlot = medians %>% ggplot(aes(x = reorder(State,-median_ABV), y = median_ABV, fill = State )) +
geom_col() +
ggtitle("Median Alcohol Content of Beers by State") +
xlab("State") +
ylab("Median ABV") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
medIBUPlot = medians %>% ggplot(aes(x = reorder(State,-median_IBU), y = median_IBU, fill = State)) +
geom_col() +
ggtitle("Median Bitterness of Beers by State") +
xlab("State") +
ylab("Median IBU") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
ggplotly(medABVPlot)
ggplotly(medIBUPlot)
THe tables below show the top 5 beers in our data set in terms of ABV and IBU. Two of the top 5 IBU beers are located in Boulder, CO but all five beers are a different style. The top 5 beers in terms of ABV are all a variation of an American IPA.
kable(head(beer_merged[order(-beer_merged$ABV),c(1,6,8,9,10,4,3)], n=5), caption = "Top 5 ABV Beers") %>%
kable_styling()
| beer_name | Style | brewery_name | City | State | IBU | ABV | |
|---|---|---|---|---|---|---|---|
| 2274 | Lee Hill Series Vol. 5 - Belgian Style Quadrupel Ale | Quadrupel (Quad) | Upslope Brewing Company | Boulder | CO | 24 | 0.128 |
| 71 | London Balling | English Barleywine | Against the Grain Brewery | Louisville | KY | 80 | 0.125 |
| 2185 | Csar | Russian Imperial Stout | Tin Man Brewing Company | Evansville | IN | 90 | 0.120 |
| 2275 | Lee Hill Series Vol. 4 - Manhattan Style Rye Ale | Rye Beer | Upslope Brewing Company | Boulder | CO | 52 | 0.104 |
| 1853 | 4Beans | Baltic Porter | Sixpoint Craft Ales | Brooklyn | NY | 52 | 0.100 |
kable(head(beer_merged[order(-beer_merged$IBU),c(1,6,8,9,10,4,3)], n=5),caption = "Top 5 IBU Beers") %>%
kable_styling()
| beer_name | Style | brewery_name | City | State | IBU | ABV | |
|---|---|---|---|---|---|---|---|
| 148 | Bitter Bitch Imperial IPA | American Double / Imperial IPA | Astoria Brewing Company | Astoria | OR | 138 | 0.082 |
| 2386 | Troopers Alley IPA | American IPA | Wolf Hills Brewing Company | Abingdon | VA | 135 | 0.059 |
| 526 | Dead-Eye DIPA | American Double / Imperial IPA | Cape Ann Brewing Company | Gloucester | MA | 130 | 0.090 |
| 594 | Bay of Bengal Double IPA (2014) | American Double / Imperial IPA | Christian Moerlein Brewing Company | Cincinnati | OH | 126 | 0.089 |
| 2077 | Abrasive Ale | American Double / Imperial IPA | Surly Brewing Company | Brooklyn Center | MN | 120 | 0.097 |
The average ABV of all beers is 5.97% while the median is 5.64% indicating a slightly right-skewed distribution. The histogram below reflects this. Most beers appear to hover around the mean ABV with a few potential outliers with higher ABVs.
summary(beer_merged$ABV)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02700 0.05000 0.05600 0.05974 0.06700 0.12800
beer_merged %>% ggplot(aes(x=ABV, color="Orange")) +
geom_histogram(binwidth=0.0025) +
ggtitle("Distibution of ABV for All beers") +
theme(legend.position = "none")
The scatter plot below shows a positive linear relationship between ABV and IBU for all beers. Our correlation test results produce a correlation value of positive 0.58 on a scale of -1 to 1 is statistically significantly different from zero (p value ~ 0). Furthermore, we are 95% confident that the true correlation value is between 0.55 and 0.61.
ggplot(beer_merged, aes(x=IBU, y= ABV)) +
geom_point(shape=1) +
geom_smooth(method=lm) + # add linear regression line
ggtitle("Correlation between IBU and ABV") +
labs(x="IBU - Bitterness of the beer",y="ABV - Alcoholic content of the beer")
# Pearson's product-moment correlation
df<-unlist(cor.test(beer_merged$ABV,beer_merged$IBU))
data.frame(Correlation_Value = round(as.numeric(unname(df[4])),2), P_Value = round(as.numeric(unname(df[3])),2))
## Correlation_Value P_Value
## 1 0.58 0
Based on the results of this comparison, one can conclude that if a new beer is made and it has an alcohol content between 0.055 and 0.078, with IBU between 60 and 75, that beer will be considerd by most to be an IPA no matter the beer maker decides to name it.
head(beer_merged)
## beer_name Beer_ID ABV IBU Brewery_id
## 1 Pub Beer 1436 0.050 26.75000 409
## 2 Devil's Cup 2265 0.066 44.94118 178
## 3 Rise of the Phoenix 2264 0.071 67.63455 178
## 4 Sinister 2263 0.090 93.32000 178
## 5 Sex and Candy 2262 0.075 67.63455 178
## 6 Black Exodus 2261 0.077 28.09091 178
## Style Ounces brewery_name City State
## 1 American Pale Lager 12 10 Barrel Brewing Company Bend OR
## 2 American Pale Ale (APA) 12 18th Street Brewery Gary IN
## 3 American IPA 12 18th Street Brewery Gary IN
## 4 American Double / Imperial IPA 12 18th Street Brewery Gary IN
## 5 American IPA 12 18th Street Brewery Gary IN
## 6 Oatmeal Stout 12 18th Street Brewery Gary IN
## meanABV meanIBU
## 1 0.05121622 26.75000
## 2 0.05457741 44.94118
## 3 0.06452758 67.63455
## 4 0.08736893 93.32000
## 5 0.06452758 67.63455
## 6 0.05966667 28.09091
df1 = data.frame(ABV = 0.0045 , IBU = 26 )
knn(beer_merged[,3:4], df1, beer_merged$Style, k = 3, prob = TRUE)
## [1] Milk / Sweet Stout
## attr(,"prob")
## [1] 0.3333333
## 100 Levels: Abbey Single Ale Altbier ... Witbier
summary(beer_merged)
## beer_name Beer_ID ABV
## Nonstop Hef Hop : 12 Min. : 1.0 Min. :0.02700
## Dale's Pale Ale : 6 1st Qu.: 805.2 1st Qu.:0.05000
## Oktoberfest : 6 Median :1452.5 Median :0.05600
## Longboard Island Lager: 4 Mean :1429.1 Mean :0.05974
## 1327 Pod's ESB : 3 3rd Qu.:2073.8 3rd Qu.:0.06700
## Boston Lager : 3 Max. :2692.0 Max. :0.12800
## (Other) :2356
## IBU Brewery_id Style
## Min. : 0.00 Min. : 1.0 American IPA : 424
## 1st Qu.: 21.78 1st Qu.: 94.0 American Pale Ale (APA) : 245
## Median : 33.81 Median :206.0 American Amber / Red Ale : 133
## Mean : 40.27 Mean :232.6 American Blonde Ale : 108
## 3rd Qu.: 59.50 3rd Qu.:367.0 American Double / Imperial IPA: 105
## Max. :138.00 Max. :558.0 American Pale Wheat Ale : 97
## (Other) :1278
## Ounces brewery_name City
## Min. : 8.40 Brewery Vivant : 60 Grand Rapids: 64
## 1st Qu.:12.00 Oskar Blues Brewery : 44 Portland : 64
## Median :12.00 Sun King Brewing Company : 38 Chicago : 55
## Mean :13.58 Cigar City Brewing Company: 25 Indianapolis: 43
## 3rd Qu.:16.00 Sixpoint Craft Ales : 24 San Diego : 42
## Max. :32.00 Hopworks Urban Brewery : 23 Denver : 40
## (Other) :2176 (Other) :2082
## State meanABV meanIBU
## Length:2390 Min. :0.04033 Min. : 7.80
## Class :character 1st Qu.:0.05233 1st Qu.:23.25
## Mode :character Median :0.05746 Median :34.18
## Mean :0.05974 Mean :40.90
## 3rd Qu.:0.06453 3rd Qu.:66.67
## Max. :0.10767 Max. :96.00
## NA's :37
set.seed(1000)
iterations = 100
#Change Style to a factor (STYLE HAS TO BE A FACTOR) and IBU/ABV to numeric
beer_merged$Style = as.factor(beer_merged$Style)
beer_merged$ABV<-as.numeric(as.character(beer_merged$ABV))
beer_merged$IBU<-as.numeric(as.character(beer_merged$IBU))
IPAandOther = beer_merged %>%
filter(!is.na(Style) & !is.na(ABV) & !is.na(IBU)) %>%
filter(Style == grep("+IPA", Style, ignore.case = FALSE, value = TRUE) | Style == grep("+Ale", Style, ignore.case = FALSE, value = TRUE))
## Warning in `==.default`(Style, grep("+IPA", Style, ignore.case = FALSE, : longer
## object length is not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(Style, grep("+Ale", Style, ignore.case = FALSE, : longer
## object length is not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
# Plot the difference in IPA and other ales
IPAandOtherPlot = IPAandOther %>%
mutate(Ales = ifelse(Style == grep("+IPA", Style, ignore.case = FALSE, value = TRUE),
"IPA", "Other Ale")) %>%
ggplot(aes(x=IBU, y= ABV, color = Ales)) + geom_point() +
ggtitle("Comparison between IPA and Other Ales")
## Warning in `==.default`(Style, grep("+IPA", Style, ignore.case = FALSE, : longer
## object length is not a multiple of shorter object length
## Warning in `==.default`(Style, grep("+IPA", Style, ignore.case = FALSE, : longer
## object length is not a multiple of shorter object length
ggplotly(IPAandOtherPlot)
trainIndices = sample(1:dim(IPAandOther)[1],round(.7 * dim(IPAandOther)[1]))
train = IPAandOther[trainIndices,]
test = IPAandOther[-trainIndices,]
accs = data.frame(accuracy = numeric(iterations), k = numeric(iterations))
ColoradoData = beer_merged %>% filter(State == "CO")
ColoradoData %>% count(brewery_name, n())
## # A tibble: 45 x 3
## brewery_name `n()` n
## <fct> <int> <int>
## 1 AC Golden Brewing Company 260 2
## 2 Arctic Craft Brewery 260 1
## 3 Asher Brewing Company 260 1
## 4 Aspen Brewing Company 260 2
## 5 Avery Brewing Company 260 6
## 6 Big Choice Brewing 260 4
## 7 Black Shirt Brewing Company 260 4
## 8 Bonfire Brewing Company 260 19
## 9 Boulder Beer Company 260 3
## 10 Breckenridge Brewery 260 4
## # ... with 35 more rows